Instruction fetch apparatus for wide issue processors and method of operation

ABSTRACT

There is disclosed a data processor containing an instruction issue unit that efficiently transfers instruction bundles from a cache to an instruction pipeline. The data processor comprises 1) an instruction pipeline comprising N processing stages; and 2) an instruction issue unit for fetching into the instruction pipeline instructions fetched from the instruction cache, each of the fetched instructions comprising from one to S syllables. The instruction issue unit comprises: a) a first buffer comprising S storage locations for storing up to S syllables associated with the fetched instructions, each of the S storage locations storing one of the one to S syllables of each fetched instruction; b) a second buffer comprising S storage locations for storing up to S syllables associated with the fetched instructions, each of the S storage locations for storing one of the one to S syllables of each fetched instruction; and c) a controller for determining if a first one of the S storage locations in the first buffer is full, wherein the controller, in response to such a determination, stores a corresponding syllable in an incoming fetched instruction in one of the S storage locations in the second buffer.

CROSS-REFERENCE TO RELATED APPLICATIONS The present invention is relatedto those disclosed in the following United States Patent Applications:

[0001] 1) Ser. No. [Docket No. 00-BN-051], filed concurrently herewith,entitled “SYSTEM AND METHOD FOR EXECUTING VARIABLE LATENCY LOADOPERATIONS IN A DATA PROCESSOR”;

[0002] 2) Ser. No. [Docket No. 00-BN-052], filed concurrently herewith,entitled “PROCESSOR PIPELINE STALL APPARATUS AND METHOD OF OPERATION”;

[0003] 3) Ser. No. [Docket No. 00-BN-053], filed concurrently herewith,entitled “CIRCUIT AND METHOD FOR HARDWARE-ASSISTED SOFTWARE FLUSHING OFDATA AND INSTRUCTION CACHES”;

[0004] 4) Ser. No. [Docket No. 00-BN-054], filed concurrently herewith,entitled “CIRCUIT AND METHOD FOR SUPPORTING MISALIGNED ACCESSES IN THEPRESENCE OF SPECULATIVE LOAD INSTRUCTIONS”;

[0005] 5) Ser. No. [Docket No. 00-BN-055], filed concurrently herewith,entitled “BYPASS CIRCUITRY FOR USE IN A PIPELINED PROCESSOR”;

[0006] 6) Ser. No. [Docket No. 00-BN-056], filed concurrently herewith,entitled “SYSTEM AND METHOD FOR EXECUTING CONDITIONAL BRANCHINSTRUCTIONS IN A DATA PROCESSOR”;

[0007] 7) Ser. No. [Docket No. 00-BN-057], filed concurrently herewith,entitled “SYSTEM AND METHOD FOR ENCODING CONSTANT OPERANDS IN A WIDEISSUE PROCESSOR”;

[0008] 8) Ser. No. [Docket No. 00-BN-058], filed concurrently herewith,entitled “SYSTEM AND METHOD FOR SUPPORTING PRECISE EXCEPTIONS IN A DATAPROCESSOR HAVING A CLUSTERED ARCHITECTURE”;

[0009] 9) Ser. No. [Docket No. 00-BN-059], filed concurrently herewith,entitled “CIRCUIT AND METHOD FOR INSTRUCTION COMPRESSION AND DISPERSALIN WIDE-ISSUE PROCESSORS”;

[0010] 10) Ser. No. [Docket No. 00-BN-066], filed concurrently herewith,entitled “SYSTEM AND METHOD FOR REDUCING POWER CONSUMPTION IN A DATAPROCESSOR HAVING A CLUSTERED ARCHITECTURE”.

[0011] The above applications are commonly assigned to the assignee ofthe present invention. The disclosures of these related patentapplications are hereby incorporated by reference for all purposes as iffully set forth herein.

TECHNICAL FIELD OF THE INVENTION

[0012] The present invention is generally directed to data processorsand, more specifically, to an efficient instruction fetch engine for usein a wide issue data processor.

BACKGROUND OF THE INVENTION

[0013] The demand for high performance computers requires thatstate-of-the-art microprocessors execute instructions in the minimumamount of time. A number of different approaches have been taken todecrease instruction execution time, thereby increasing processorthroughput. One way to increase processor throughput is to use apipeline architecture in which the processor is divided into separateprocessing stages that form the pipeline. Instructions are broken downinto elemental steps that are executed in different stages in anassembly line fashion.

[0014] A pipelined processor is capable of executing several differentmachine instructions concurrently. This is accomplished by breaking downthe processing steps for each instruction into several discreteprocessing phases, each of which is executed by a separate pipelinestage. Hence, each instruction must pass sequentially through eachpipeline stage in order to complete its execution. In general, a giveninstruction is processed by only one pipeline stage at a time, with oneclock cycle being required for each stage. Since instructions use thepipeline stages in the same order and typically only stay in each stagefor a single clock cycle, an N stage pipeline is capable ofsimultaneously processing N instructions. When filled with instructions,a processor with N pipeline stages completes one instruction each clockcycle.

[0015] The execution rate of an N-stage pipeline processor istheoretically N times faster than an equivalent non-pipelined processor.A non-pipelined processor is a processor that completes execution of oneinstruction before proceeding to the next instruction. Typically,pipeline overheads and other factors decrease somewhat the executionrate advantage that a pipelined processor has over a non-pipelinedprocessor.

[0016] An exemplary seven stage processor pipeline may consist of anaddress generation stage, an instruction fetch stage, a decode stage, aread stage, a pair of execution (E1 and E2) stages, and a write (orwrite-back) stage. In addition, the processor may have an instructioncache that stores program instructions for execution, a data cache thattemporarily stores data operands that otherwise are stored in processormemory, and a register file that also temporarily stores data operands.

[0017] The address generation stage generates the address of the nextinstruction to be fetched from the instruction cache. The instructionfetch stage fetches an instruction for execution from the instructioncache and stores the fetched instruction in an instruction buffer. Thedecode stage takes the instruction from the instruction buffer anddecodes the instruction into a set of signals that can be directly usedfor executing subsequent pipeline stages. The read stage fetchesrequired operands from the data cache or registers in the register file.The E1 and E2 stages perform the actual program operation (e.g., add,multiply, divide, and the like) on the operands fetched by the readstage and generates the result. The write stage then writes the resultgenerated by the E1 and E2 stages back into the data cache or theregister file.

[0018] Assuming that each pipeline stage completes its operation in oneclock cycle, the exemplary seven stage processor pipeline takes sevenclock cycles to process one instruction. As previously described, oncethe pipeline is full, an instruction can theoretically be completedevery clock cycle.

[0019] The throughput of a processor also is affected by the size of theinstruction set executed by the processor and the resulting complexityof the instruction decoder. Large instruction sets require large,complex decoders in order to maintain a high processor throughput.However, large complex decoders tend to increase power dissipation, diesize and the cost of the processor. The throughput of a processor alsomay be affected by other factors, such as exception handling, data andinstruction cache sizes, multiple parallel instruction pipelines, andthe like. All of these factors increase or at least maintain processorthroughput by means of complex and/or redundant circuitry thatsimultaneously increases power dissipation, die size and cost.

[0020] In many processor applications, the increased cost, increasedpower dissipation, and increased die size are tolerable, such as inpersonal computers and network servers that use x86-based processors.These types of processors include, for example, Intel Pentium™processors and AMD Athlon™ processors.

[0021] However, in many applications it is essential to minimize thesize, cost, and power requirements of a data processor. This has led tothe development of processors that are optimized to meet particularsize, cost and/or power limits. For example, the recently developedTransmeta Crusoe™ processor greatly reduces the amount of power consumedby the processor when executing most x86 based programs. This isparticularly useful in laptop computer applications. Other types of dataprocessors may be optimized for use in consumer appliances (e.g.,televisions, video players, radios, digital music players, and the like)and office equipment (e.g., printers, copiers, fax machines, telephonesystems, and other peripheral devices). The general design objectivesfor data processors used in consumer appliances and office equipment arethe minimization of cost and complexity of the data processor.

[0022] Many pipelined processors are implemented as very largeinstruction word (VLIW) devices that allow the parallel execution ofmultiple instructions in two or more instruction pipelines. A commonproblem in VLIW processors is the complexity of the fetch andinstruction alignment circuitry. The problem arises because variablenumbers of instructions are executed each cycle, making it difficult todecide where to fetch from next. Some prior art solutions requireextremely simple algorithms that suffer more stall cycles thannecessary. Other prior art solutions use a single point of size encoding(e.g., IA64), which results in a more complex instruction decode circuitand a less flexible issue strategy.

[0023] Therefore, there is a need in the art for improved pipelinearchitectures that allow efficient implementation of very largeinstruction words (VLIW) in a data processor. In particular, there is aneed in the art for an instruction fetch engine that can fetchvariable-length very large instruction words from an instruction cacheand issue the instructions into an execution pipeline with minimumdelay. More particularly, there is a need in the art for an instructionfetch engine that can determine when all portions of a variable-lengthVLIW have been fetched from an instruction cache and can issue thecomplete instruction into an execution pipeline with minimum delay.

SUMMARY OF THE INVENTION

[0024] To address the above-discussed deficiencies of the prior art, itis a primary object of the present invention to provide a data processorthat implements an instruction issue unit that efficiently transfersinstruction bundles from an instruction cache to an instructionexecution pipeline with minimum delay. According to an advantageousembodiment of the present invention, the data processor comprises 1) aninstruction execution pipeline comprising N processing stages; and 2) aninstruction issue unit capable of fetching into the instructionexecution pipeline instructions fetched from an instruction cacheassociated with the data processor, each of the fetched instructionscomprising from one to S syllables. The instruction issue unitcomprises: a) a first buffer comprising S storage locations capable ofreceiving and storing the one to S syllables associated with the fetchedinstructions, each of the S storage locations capable of storing one ofthe one to S syllables of each fetched instruction; b) a second buffercomprising S storage locations capable of receiving and storing the oneto S syllables associated with the fetched instructions, each of the Sstorage locations capable of storing one of the one to S syllables ofeach fetched instruction; and c) a controller capable of determining ifa first one of the S storage locations in the first buffer is full,wherein the controller, in response to a determination that the firstone of the S storage locations is full, causes a corresponding syllablein an incoming fetched instruction to be stored in a corresponding oneof the S storage locations in the second buffer.

[0025] According to one embodiment of the present invention, the valueof S is four.

[0026] According to another embodiment of the present invention, thevalue of S is eight.

[0027] According to still another embodiment of the present invention,the value of S is a multiple of four.

[0028] According to a yet another embodiment of the present invention,each of the one to S syllables comprises 32 bits.

[0029] According to a further embodiment of the present invention, eachof the one to S syllables comprises 16 bits.

[0030] According to a still further embodiment of the present invention,each of the one to S syllables comprises 64 bits.

[0031] According to a yet further embodiment of the present invention,the controller is capable of determining when all of the syllables inone of the fetched instructions are present in the first buffer, whereinthe controller, in response to a determination that the all of thesyllables are present, causes the all of the syllables to be transferredfrom the first buffer to the instruction execution pipeline.

[0032] In one embodiment of the present invention, the controller iscapable of determining if a syllable in the first one of the S storagelocations in the first buffer has been transferred from the first bufferto the instruction pipeline, wherein the controller, in response to adetermination that the first one of the S storage locations has beentransferred, causes the corresponding syllable stored in thecorresponding one of the S storage locations in the second buffer to betransferred to the first one of the S storage locations in the firstbuffer.

[0033] In another embodiment of the present invention, the dataprocessor further comprises a switching circuit controlled by thecontroller and operable to transfer syllables from the second buffer tothe first buffer.

[0034] The foregoing has outlined rather broadly the features andtechnical advantages of the present invention so that those skilled inthe art may better understand the detailed description of the inventionthat follows. Additional features and advantages of the invention willbe described hereinafter that form the subject of the claims of theinvention. Those skilled in the art should appreciate that they mayreadily use the conception and the specific embodiment disclosed as abasis for modifying or designing other structures for carrying out thesame purposes of the present invention. Those skilled in the art shouldalso realize that such equivalent constructions do not depart from thespirit and scope of the invention in its broadest form.

[0035] Before undertaking the DETAILED DESCRIPTION OF THE INVENTIONbelow, it may be advantageous to set forth definitions of certain wordsand phrases used throughout this patent document: the terms “include”and “comprise,” as well as derivatives thereof, mean inclusion withoutlimitation; the term “or,” is inclusive, meaning and/or; the phrases“associated with” and “associated therewith,” as well as derivativesthereof, may mean to include, be included within, interconnect with,contain, be contained within, connect to or with, couple to or with, becommunicable with, cooperate with, interleave, juxtapose, be proximateto, be bound to or with, have, have a property of, or the like; and theterm “controller” means any device, system or part thereof that controlsat least one operation, such a device may be implemented in hardware,firmware or software, or some combination of at least two of the same.It should be noted that the functionality associated with any particularcontroller may be centralized or distributed, whether locally orremotely. Definitions for certain words and phrases are providedthroughout this patent document, those of ordinary skill in the artshould understand that in many, if not most instances, such definitionsapply to prior, as well as future uses of such defined words andphrases.

BRIEF DESCRIPTION OF THE DRAWINGS

[0036] For a more complete understanding of the present invention, andthe advantages thereof, reference is now made to the followingdescriptions taken in conjunction with the accompanying drawings,wherein like numbers designate like objects, and in which:

[0037]FIG. 1 is a block diagram of a processing system that contains adata processor in accordance with the principles of the presentinvention;

[0038]FIG. 2 illustrates the exemplary data processor in greater detailaccording to one embodiment of the present invention;

[0039]FIG. 3 illustrates a cluster in the exemplary data processoraccording to one embodiment of the present invention;

[0040]FIG. 4 illustrates the operational stages of the exemplary dataprocessor according to one embodiment of the present invention;

[0041]FIG. 5 is a block diagram illustrating selected portions of aninstruction fetch apparatus according to one embodiment of the presentinvention;

[0042]FIG. 6 is a block diagram illustrating the contents of theinstruction cache in the exemplary data processor according to oneembodiment of the present invention; and

[0043]FIG. 7A-7D are block diagrams illustrating the flow ofinstructions through the instruction issue buffers according to oneembodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0044]FIGS. 1 through 7, discussed below, and the various embodimentsused to describe the principles of the present invention in this patentdocument are by way of illustration only and should not be construed inany way to limit the scope of the invention. Those skilled in the artwill understand that the principles of the present invention may beimplemented in any suitably arranged data processor.

[0045]FIG. 1 is a block diagram of processing system 10, which containsdata processor 100 in accordance with the principles of the presentinvention. Data processor 100 comprises processor core 105 and Nmemory-mapped peripherals interconnected by system bus 120. The Nmemory-mapped peripherals include exemplary memory-mapped peripherals111-114, which are arbitrarily labeled Memory-Mapped Peripheral 1,Memory-Mapped Peripheral 2, Memory-Mapped Peripheral 3, andMemory-Mapped Peripheral N. Processing system 10 also comprises mainmemory 130. In an advantageous embodiment of the present invention, mainmemory 130 may be subdivided into program memory 140 and data memory150.

[0046] The cost and complexity of data processor 100 is minimized byexcluding from processor core 105 complex functions that may beimplemented by one or more of memory-mapped peripherals 111-114. Forexample, memory-mapped peripheral 111 may be a video codec andmemory-mapped peripheral 112 may be an audio codec. Similarly,memory-mapped peripheral 113 may be used to control cache flushing. Thecost and complexity of data processor 100 is further minimized byimplementing extremely simple exception behavior in processor core 105,as explained below in greater detail.

[0047] Processing system 10 is shown in a general level of detailbecause it is intended to represent any one of a wide variety ofelectronic devices, particularly consumer appliances. For example,processing system 10 may be a printer rendering system for use in aconventional laser printer. Processing system 10 also may representselected portions of the video and audio compression-decompressioncircuitry of a video playback system, such as a video cassette recorderor a digital versatile disk (DVD) player. In another alternativeembodiment, processing system 10 may comprise selected portions of acable television set-top box or a stereo receiver. The memory-mappedperipherals and a simplified processor core reduce the cost of dataprocessor 100 so that it may be used in such price sensitive consumerappliances.

[0048] In the illustrated embodiment, memory-mapped peripherals 111114are shown disposed within data processor 100 and program memory 140 anddata memory 150 are shown external to data processor 100. It will beappreciated by those skilled in the art that this particularconfiguration is shown by way of illustration only and should not beconstrued so as to limit the scope of the present invention in any way.In alternative embodiments of the present invention, one or more ofmemory-mapped peripherals 111-114 may be externally coupled to dataprocessor 100. Similarly, in another embodiment of the presentinvention, one or both of program memory 140 and data memory 150 may bedisposed on-chip in data processor 100.

[0049]FIG. 2 is a more detailed block diagram of exemplary dataprocessor 100 according to one embodiment of the present invention. Dataprocessor 100 comprises instruction fetch cache and expansion unit(IFCEXU) 210, which contains instruction cache 215, and a plurality ofclusters, including exemplary clusters 220-222. Exemplary clusters220-222 are labeled Cluster 0, Cluster 1 and Cluster 2, respectively.Data processor 100 also comprises core memory controller 230 andinterrupt and exception controller 240.

[0050] A fundamental object of the design of data processor 100 is toexclude from the core of data processor 100 most of the functions thatcan be implemented using memory-mapped peripherals external to the coreof data processor 100. By way of example, in an exemplary embodiment ofthe present invention, cache flushing may be efficiently accomplishedusing software in conjunction with a small memory-mapped device. Anotherobject of the design of data processor 100 is to implement a staticallyscheduled instruction pipeline with an extremely simple exceptionbehavior.

[0051] Clusters 220-222 are basic execution units that comprise one morearithmetic units, a register file, an interface to core memorycontroller 230, including a data cache, and an inter-clustercommunication interface. In an exemplary embodiment of the presentinvention, the core of data processor 100 may comprise only a singlecluster, such as exemplary cluster 220.

[0052] Because conventional processor cores can execute multiplesimultaneously issued operations, the traditional word “instruction” ishereby defined with greater specificity. For the purposes of thisdisclosure, the following terminology is adopted. An “instruction” or“instruction bundle” is a group of simultaneously issued operationsencoded as “instruction syllables”. Each instruction syllable is encodedas a single machine word. Each of the operations constituting aninstruction bundle may be encoded as one or more instruction syllables.Hereafter, the present disclosure may use the shortened forms“instruction” and “bundle” interchangeably and may use the shortenedform “syllable.” In an exemplary embodiment of the present invention,each instruction bundle consists of 1 to 4 instruction syllables. Flowcontrol operations, such as branch or call, are encoded in singleinstruction syllables.

[0053]FIG. 3 is a more detailed block diagram of cluster 220 in dataprocessor 100 according to one embodiment of the present invention.Cluster 220 comprises instruction buffer 305, register file 310, programcounter (PC) and branch unit 315, instruction decoder 320, load storeunit 325, data cache 330, integer units 341-344, and multipliers351-352. Cluster 220 is implemented as an instruction pipeline.

[0054] Instructions are issued to an operand read stage associated withregister file 310 and then propagated to the execution units (i.e.,integer units 341-244, multipliers 351-352). Cluster 220 accepts onebundle comprising one to four syllables in each cycle. The bundle mayconsist of any combination of four integer operations, twomultiplication operations, one memory operation (i.e., read or write)and one branch operation. Operations that require long immediates(constants) require two syllables.

[0055] In specifying a cluster, it is assumed that no instruction bitsare used to associate operations with functional units. For example,arithmetic or load/store operations may be placed in any of the fourwords encoding the operations for a single cycle. This may requireimposing some addressing alignment restrictions on multiply operationsand long immediates (constants).

[0056] This following describes the architectural (programmer visible)status of the core of data processor 100. One design objective of dataprocessor 100 is to minimize the architectural status. All non-uservisible status information resides in a memory map, in order to reducethe number of special instructions required to access such information.

[0057] Program Counter

[0058] In an exemplary embodiment of the present invention, the programcounter (PC) in program counter and branch unit 315 is a 32-bit byteaddress pointing to the beginning of the current instruction bundle inmemory. The two least significant bits (LSBs) of the program counter arealways zero. In operations that assign a value to the program counter,the two LSBs of the assigned value are ignored.

[0059] Register File 310

[0060] In an exemplary embodiment, register file 310 contains 64 wordsof 32 bits each. Reading Register 0 (i.e., R0) always returns the valuezero.

[0061] Link Register

[0062] Register 63 (i.e., R63) is used to address the link register bythe call and return instructions. The link register (LR) is a slavedcopy of the architecturally most recent update to R63. R63 can be usedas a normal register, between call and return instructions. The linkregister is updated only by writes to R63 and the call instruction. Attimes the fact that the link register is a copy of R63 and not R63itself may be visible to the programmer. This is because the linkregister and R63 get updated at different times in the pipeline.Typically, this occurs in the following cases:

[0063] 1) ICALL and IGOTO instructions—Since these instructions areexecuted in the decode stage, these operations require that R63 bestable. Thus, R63 must not be modified in the instruction bundlepreceding one of these operations. Otherwise unpredictable results mayoccur in the event of an interrupt; and

[0064] 2) An interrupt or exception may update the link registerincorrectly. Thus, all interrupt and exception handlers must explicitlywrite R63 prior to using the link register through the execution of anRFI, ICALL or IGOTO instruction. This requirement can be met with asimple MOV instruction from R63 to R63.

[0065] Branch Bit File

[0066] The branch architecture of data processor 100 uses a set of eight(8) branch bit registers (i.e., B0 through B7) that may be read orwritten independently. In an exemplary embodiment of the presentinvention, data processor 100 requires at least one instruction to beexecuted between writing a branch bit and using the result in aconditional branch operation.

[0067] Control Registers

[0068] A small number of memory mapped control registers are part of thearchitectural state of data processor 100. These registers includesupport for interrupts and exceptions, and memory protection.

[0069] The core of data processor 100 is implemented as a pipeline thatrequires minimal instruction decoding in the early pipeline stages. Onedesign objective of the pipeline of data processor 100 is that itsupport precise interrupts and exceptions. Data processor 100 meets thisobjective by updating architecturally visible state information onlyduring a single write stage. To accomplish this, data processor 100makes extensive use of register bypassing circuitry to minimize theperformance impact of meeting this requirement.

[0070]FIG. 4 is a block diagram illustrating the operational stages ofpipeline 400 in exemplary data processor 100 according to one embodimentof the present invention. In the illustrated embodiment, the operationalstages of data processor 100 are address generation stage 401, fetchstage 402, decode stage 403, read stage 404, first execution (E1) stage405, second execution (E2) stage 406 and write stage 407.

[0071] Address Generation Stage 401 and Fetch Stage 402

[0072] Address generation stage 401 comprises a fetch address generator410 that generates the address of the next instruction to be fetchedfrom instruction cache 215. Fetch address generator 410 receives inputsfrom exception generator 430 and program counter and branch unit 315.Fetch address generator 410 generates an instruction fetch address(FADDR) that is applied to instruction cache 215 in fetch stage 402 andto an instruction protection unit (not shown) that generates anexception if a protection violation is found. Any exception generated infetch stage 402 is postponed to write stage 407. Instruction buffer 305in fetch stage 402 receives instructions as 128-bit wide words frominstruction cache 215 and the instructions are dispatched to thecluster.

[0073] Decode Stage 403

[0074] Decode stage 403 comprises instruction decode block 415 andprogram counter (PC) and branch unit 315. Instruction decode block 415receives instructions from instruction buffer 305 and decodes theinstructions into a group of control signals that are applied to theexecution units in E1 stage 405 and E2 stage 406. Program counter andbranch unit 315 evaluates branches detected within the 128-bit widewords. A taken branch incurs a one cycle delay and the instruction beingincorrectly fetched while the branch instruction is evaluated isdiscarded.

[0075] Read Stage 404

[0076] In read stage 404, operands are generated by register fileaccess, bypass and immediate (constant) generation block 420. Thesources for operands are the register files, the constants (immediates)assembled from the instruction bundle, and any results bypassed fromoperations in later stages in the instruction pipeline.

[0077] E1 Stage 405 and E2 Stage 406

[0078] The instruction execution phase of data processor 100 isimplemented as two stages, E1 stage 405 and E2 stage 406 to allow twocycle cache access operations and two cycle multiplication operations.Exemplary multiplier 351 is illustrated straddling the boundary betweenE1 stage 405 and E2 stage 406 to indicate a two cycle multiplicationoperation. Similarly, load store unit 325 and data cache 330 areillustrated straddling the boundary between El stage 405 and E2 stage406 to indicate a two cycle cache access operation. Integer operationsare performed by integer units, such as IU 341 in E1 stage 405.Exceptions are generated by exception generator 430 in E2 stage 406 andwrite stage 407.

[0079] Results from fast operations are made available after El stage405 through register bypassing operations. An important architecturalrequirement of data processor 100 is that if the results of an operationmay be ready after E1 stage 405, then the results are always ready afterE1 stage 405. In this manner, the visible latency of operations in dataprocessor 100 is fixed.

[0080] Write Stage 407

[0081] At the start of write stage 407, any pending exceptions areraised and, if no exceptions are raised, results are written by registerwrite back and bypass block 440 into the appropriate register fileand/or data cache location. In data processor 100, write stage 407 isthe “commit point” and operations reaching write stage 407 in theinstruction pipeline and not “excepted” are considered completed.Previous stages (i.e., address generation, fetch, decode, read, E1, E2)are temporally prior to the commit point. Therefore, operations inaddress generation stage 401, fetch stage 402, decode stage 403, readstage 404, E1 stage 405 and E2 stage 406 are flushed when an exceptionoccurs and are acted upon in write stage 407.

[0082] As the above description indicates, data processor 100 is a verylarge instruction word (VLIW) device that allow the parallel executionof multiple instructions in two or more instruction pipelines inclusters 220-222. In an exemplary embodiment, instruction cache 215comprises cache lines that are 512 bits (i.e., 64 bytes) long. Eachsyllable (i.e., smallest instruction size) comprises 32 bits (i.e., 4bytes), such that a cache line comprises 16 syllables. Each instructionsyllable is encoded as a single 32-bit machine word.

[0083] Instructions are fetched from instruction cache 215 in groups offour syllables (i.e., 128 bits). A complete instruction may compriseone, two, three or four syllables. The fetched syllables are issued intoone of four issues lanes leading into the instruction pipeline. The fourissue lanes are referred to as Issue Lane 0, Issue Lane 1, Issue Lane 2,and Issue Lane 3. Because instructions are of variable length andbecause a branch instruction may fetch instructions starting at anypoint in instruction cache 215, there is no guarantee that all of thesyllables in an instruction will be fetched in the same cache access.There also is no guarantee that a particular syllable in an instructionwill be aligned to a particular issue lane in clusters 220-222.

[0084] In order to minimize the amount of delay incurred in fetchinginstructions, the present invention implements an instruction issue unitcomprising a sequence of instruction issue unit buffers (IIUBs) thattemporarily store the syllables of an instruction until all syllables ofthe instruction are present. The complete instruction, consisting of oneto four syllables, is then issued into the four issues lanes of thepipeline. If an instruction has less than four syllables, one or moreno-operation (NOP) instructions are issued into the unused issue lanes.In the exemplary embodiment that follows, two instruction issue unitbuffers are used to buffer up to four 32-bit syllables.

[0085] However, it should be understood that the selection of thesevalues is by way of example only and should not be construed to limitthe scope of the present invention. Those skilled in the art willrecognize that other syllable size, buffer size and instruction sizesmay be used. For example, in an alternate embodiment of the presentinvention, a syllable may comprise eight bits, sixteen bits, sixty-fourbits, or the like, rather than thirty-two bits. Also, the instructionissue unit buffers may hold eight syllables, twelve syllables, sixteensyllables, or the like, instead of four syllables.

[0086]FIG. 5 is a block diagram illustrating selected portions ofinstruction issue unit 500 according to one embodiment of the presentinvention. Instruction issue unit 500 comprises instruction issuecontroller 550, registers 511, 521, 531 and 541, multiplexers (MUXs)512, 522, 532 and 542, registers 513, 523, 533 and 543, and MUX 560.Registers 513, 523, 533, and 543 comprise a first instruction issue unitbuffer, referred to hereafter as Instruction Issue Unit Buffer 0(IIUB0). Registers 511, 521, 531, and 541 comprise a second instructionissue unit buffer, referred to hereafter as Instruction Issue UnitBuffer 1 (IIUB1).

[0087] The alignment of cache accesses to instruction cache 215 isdetermined by the branch target alignment. Each cache access after anaccess to a branch target fetches four syllables using the samealignment until the next taken branch occurs or a cache line boundary iscrossed. Each line of the cache is organized as four independentlyaddressable cache banks aligned with four issue lanes. The first cachebank holds Syllable 0 and is aligned with the first issue lane, referredto as Issue Lane 0. The second cache bank holds Syllable 1 and isaligned with the second issue lane, referred to as Issue Lane 1. Thethird cache bank holds Syllable 2 and is aligned with the third issuelane, referred to as Issue Lane 2. The fourth cache bank holds Syllable3 and is aligned with the fourth issue lane, referred to as Issue Lane3.

[0088] Since there is no guarantee that the first syllable of aninstruction is aligned to particular cache bank of issue lane, a branchaddress may access an instruction aligned starting in any issue lane andcache bank. Thus, a four syllable instruction may begin in the thirdcache bank (i.e., Syllable 2 position) and be aligned to Issue Lane 2.For example, if Instruction A comprises four syllables A0, A1, A2 andA3, the four syllables may be fetched into Issue Lane 2, Issue Lane 3,Issue Lane 0, and Issue Lane 1. A branch instruction is always indicatedby the first syllable in an instruction bundle. Hence, the outputs ofregisters 513, 523, 533 and 543 are input to separate channels ofmultiplexer (MUX) 560 and are individually selected by the START OFBUNDLE control signal.

[0089] Instruction issue controller 550 controls the transfer ofSyllable 3, Syllable 2, Syllable 1 and Syllable 0 from instruction cache215 to Issue Lane 3, Issue Lane 2, Issue Lane 1, and Issue Lane 0,respectively. A Stop bit is use in the highest syllable of aninstruction bundle to indicate the end of the syllable. Thus, in a threesyllable instruction bundle comprising syllables A0, A1 and A2, the Stopbit is in syllable A2.

[0090] Ideally, each of the four syllables in an cache fetch are loadedfrom instruction cache 215 directly into the empty registers ininstruction issue unit (IIU) buffer 0 (i.e., registers 513, 523, 533 and543). In such a case, instruction issue controller 550 sets the MUXCONTROL signal to switch all four syllables to the inputs of registers513, 523, 533 and 543. Instruction issue controller 550 also selectivelyenables each of registers 513, 523, 533 and 543 using individual LoadEnable 2 (LE2) signals.

[0091] However, if previously fetched syllables are still in one or moreof registers 513, 523, 533 and 543 when the next instruction bundle isfetched, instruction issue controller 550 sets the individual MUXCONTROL signals to selectively switch the corresponding ones of the foursyllables in the next instruction to the inputs of registers 511, 521,531 and 541 (i.e., Instruction Issue Unit (IIU) Buffer 1). Instructionissue controller 550 also selectively enables each of registers 511,521, 531 and 541 using individual Load Enable 1 (LE1) signals. Thus, asyllable may be delayed temporarily in IIU Buffer 1 until thecorresponding register in IIU Buffer 0 becomes empty.

[0092] The operation of instruction issue unit 500 may best beunderstood with reference to FIG. 6 and FIGS. 7A-7D. FIG. 6 is a blockdiagram illustrating the contents of instruction cache 215 in exemplarydata processor 100 according to one embodiment of the present invention.FIG. 7A-7D are block diagrams illustrating the flow of instructionbundles and syllables through Instruction Issue Unit Buffer 0 (IIUB0)and Instruction Issue Unit Buffer 1 (IIUB1) according to one embodimentof the present invention.

[0093] Instruction cache 215 contains an exemplary sequence of seveninstruction bundles, referred to as Instructions A, B, C, D, E, F and Gwithin a single cache line. Instruction A comprises two syllables, A0and A1. Instruction B comprises one syllable, BO. Instruction Ccomprises two syllable, C0 and C1. Instruction D comprises one syllable,DO. Instruction E comprises one syllable, E0. Instruction F comprisesfour syllables, F0, F1, F2 and F3. Finally, Instruction G comprises onesyllable, G0.

[0094] Initially, IIUB0 and IIUB1 are empty and a branch instructionbegins fetching syllables in groups of four beginning at syllable A0.Since IIUB0 is empty, instruction issue controller 550 sets MUX 512, MUX522, MUX 532 and MUX 542 so that the first four syllables, A0, A1, B0and C0, are fetched into IIUB0. Syllable A0 is in the Syllable 2 slot inFIG. 5 and therefore is aligned with Issue Lane 2 (register 523).Correspondingly, Syllable A1 is aligned with Issue Lane 3 (register513), Syllable B0 is aligned with Issue Lane 0 (register 543), andSyllable C01 is aligned with Issue Lane 1 (register 533).

[0095]FIG. 7A shows the positions of A0, A1, B0 and C0 after they areloaded into IIUB0. After A0, A1, B0 and C0 are loaded into IIUB0,instruction issue controller 550 detects the Stop bit in A1, indicatingthat all of Instruction A has been fetched, and issues A0 and A1 intoIssue Lanes 2 and 3. Syllables B0 and C0 remain in IIUB0. IIUB1 (i.e.,registers 511, 521, 531 and 541) is still empty.

[0096] At this point, the next four syllables (C1, D0, E0 and F0) arefetched. Since IIUB0 is only partially empty, instruction issuecontroller 550 sets MUX 512 and MUX 522 so that syllables C1 and D0 arefetched into IIUB0 by the LE2 signal. Instruction issue controller 550also sets MUX 532 and MUX 542 so that the syllables E0 and F0 can onlybe fetched into IIUB1 by the LE1 signal. FIG. 7B shows the positions ofC1, D0, E0 and F0 after they are loaded into IIUB0 and IIUB1. After C1,D0, E0 and F0 are loaded, instruction issue controller 550 detects theStop bit in B0, indicating that all of Instruction B has been fetched,and issues B0 into Issue Lane 0. Syllables C0, C1 and D0 remain inIIUB0. IIUB1 contains E0 and F0.

[0097] At this point, the next four syllables (F1, F2, F3 and G0) arefetched. Since register 543 in IIUB0 is empty after syllable B0 isissued into Issue Lane 0, instruction issue controller 550 sets MUX 542so that syllable E0 is transferred from IIUB1 to IIUB0 by the LE2signal. Instruction issue controller 550 also sets MUX 512, MUX 522 andMUX 542 so that the syllables F1, F2 and F3 are fetched into IIUB1 bythe LE1 signal. The LE1 signal is not applied to register 531, whichstill holds syllable F0 from the previous fetch. Therefore, syllable GOis not written to register 531 in IIUB1. FIG. 7c shows the positions ofE0, F1, F2, and F3 after they are loaded into IIUB0 and IIUB1. After E0,F1, F2 and F3 are loaded, instruction issue controller 550 detects theStop bit in C1, indicating that all of Instruction C has been fetched,and issues C0 and C1 into Issue Lanes 1 and 2. Syllables D0 and E0remain in IIUB0. IIUB1 contains F3, F0, F1 and F2.

[0098] At this point, the four syllables F1, F2, F3 and G0 are refetchedin order to fetch G0, which was not loaded on the previous fetch. Sinceregisters 533 and 523 in IIUB0 are empty after syllables C0 and C1 areissued, instruction issue controller 550 sets MUX 522 and MUX 532 sothat syllables F0 and F1 are transferred from IIUB1 to IIUB0 by the LE2signal. Instruction issue controller 550 also sets MUX 532 so that thesyllable G0 is fetched into IIUB1 by the LE1 signal. The LE1 signal isnot applied to registers 541 and 511, which still hold syllables F3 andF2 from the previous fetch. The LE1 signal is also not applied toregister 521, which is empty after syllable F1 is transferred to IIUB0.FIG. 7C shows the positions of F0, F1 and G0 after F0, F1 and G0 areloaded into IIUB0 and IIUB1. After F0, F1 and G0 are loaded, instructionissue controller 550 detects the Stop bit in D0, indicating that all ofInstruction D has been fetched, and issues D0 into Issue Lane 3.Syllables E0, F0 and F1 remain in IIUB0. IIUB1 contains F3, G0 and F2.

[0099] As FIG. 6 and FIGS. 7A-7D demonstrate, instruction issue unit 500continually fetches syllables as far “forward” as possible in IIUB0 andIIUB1. If IIUB0 and IIUB1 are empty, syllables are transferred directlyinto IIUB0, the “forward-most” instruction buffer. If a register inIIUB0 is not empty, the corresponding incoming syllable is insteadloaded into IIUB1 and subsequently advances into IIUB0 when thecorresponding register becomes empty. In alternate embodiments, one ormore additional layers of buffering may be added by inserting additionalbanks of registers and multiplexers in front of IIUB0 and IIUB1.

[0100] By way of example, if a third layer of buffering is desired, athird instruction issue unit buffer, IIUB2, may be implemented byinserting a third register and a second multiplexer in each issue lane.For example, in Issue Lane 3, the output of the second multiplexer wouldbe connected to the input of register 511, one input channel of thesecond multiplexer would be connected to the output of the thirdregister, and the other input channel of the second multiplexer would beconnected directly to Syllable 3 output of instruction cache 215. Theinput of the third register also would be connected directly to Syllable3 output of instruction cache 215. The second multiplexer and the thirdregister would be controlled by instruction issue controller 550 using asecond multiplexer control signal (MUX CONT. 2) and a third load enablesignal (LE3). Those skilled in art will recognize that the presentinvention may be similarly extended to implement additional layers ofinstruction issue buffers (i.e., IIUB3, IIUB4, IIUB5 and so forth).

[0101] Although the present invention has been described in detail,those skilled in the art should understand that they can make variouschanges, substitutions and alterations herein without departing from thespirit and scope of the invention in its broadest form.

What is claimed is:
 1. A data processor comprising: an instructionexecution pipeline comprising N processing stages; and an instructionissue unit capable of fetching into said instruction execution pipelineinstructions fetched from an instruction cache associated with said dataprocessor, each of said fetched instructions comprising from one to Ssyllables, said instruction issue unit comprising; a first buffercomprising S storage locations capable of receiving and storing said oneto S syllables associated with said fetched instructions, each of said Sstorage locations capable of storing one of said one to S syllables ofeach fetched instruction; a second buffer comprising S storage locationscapable of receiving and storing said one to S syllables associated withsaid fetched instructions, each of said S storage locations capable ofstoring one of said one to S syllables of each fetched instruction; anda controller capable of determining if a first one of said S storagelocations in said first buffer is full, wherein said controller, inresponse to a determination that said first one of said S storagelocations is full, causes a corresponding syllable in an incomingfetched instruction to be stored in a corresponding one of said Sstorage locations in said second buffer.
 2. The data processor as setforth in claim 1 wherein S=4.
 3. The data processor as set forth inclaim 1 wherein S=8.
 4. The data processor as set forth in claim 1wherein S is a multiple of four.
 5. The data processor as set forth inclaim 1 wherein each of said one to S syllables comprises 32 bits. 6.The data processor as set forth in claim 1 wherein each of said one to Ssyllables comprises 16 bits.
 7. The data processor as set forth in claim1 wherein each of said one to S syllables comprises 64 bits.
 8. The dataprocessor as set forth in claim 1 wherein said controller is capable ofdetermining when all of the syllables in one of said fetchedinstructions are present in said first buffer, wherein said controller,in response to a determination that said all of said syllables arepresent, causes said all of said syllables to be transferred from saidfirst buffer to said instruction execution pipeline.
 9. The dataprocessor as set forth in claim 8 wherein said controller is capable ofdetermining if a syllable in said first one of said S storage locationsin said first buffer has been transferred from said first buffer to saidinstruction pipeline, wherein said controller, in response to adetermination that said first one of said S storage locations has beentransferred, causes said corresponding syllable stored in saidcorresponding one of said S storage locations in said second buffer tobe transferred to said first one of said S storage locations in saidfirst buffer.
 10. The data processor as set forth in claim 9 furthercomprising a switching circuit controlled by said controller andoperable to transfer syllables from said second buffer to said firstbuffer.
 11. A processing system comprising: a data processor; a memorycoupled to said data processor; a plurality of memory-mapped peripheralcircuits coupled to said data processor for performing selectedfunctions in association with said data processor, wherein said dataprocessor comprises: an instruction execution pipeline comprising Nprocessing stages; and an instruction issue unit capable of fetchinginto said instruction execution pipeline instructions fetched from aninstruction cache associated with said data processor, each of saidfetched instructions comprising from one to S syllables, saidinstruction issue unit comprising; a first buffer comprising S storagelocations capable of receiving and storing said one to S syllablesassociated with said fetched instructions, each of said S storagelocations capable of storing one of said one to S syllables of eachfetched instruction; a second buffer comprising S storage locationscapable of receiving and storing said one to S syllables associated withsaid fetched instructions, each of said S storage locations capable ofstoring one of said one to S syllables of each fetched instruction; anda controller capable of determining if a first one of said S storagelocations in said first buffer is full, wherein said controller, inresponse to a determination that said first one of said S storagelocations is full, causes a corresponding syllable in an incomingfetched instruction to be stored in a corresponding one of said Sstorage locations in said second buffer.
 12. The processing system asset forth in claim 11 wherein S=4.
 13. The processing system as setforth in claim 11 wherein S=8.
 14. The processing system as set forth inclaim 11 wherein S is a multiple of four.
 15. The processing system asset forth in claim 11 wherein each of said one to S syllables comprises32 bits.
 16. The processing system as set forth in claim 11 wherein eachof said one to S syllables comprises 16 bits.
 17. The processing systemas set forth in claim 11 wherein each of said one to S syllablescomprises 64 bits.
 18. The processing system as set forth in claim 11wherein said controller is capable of determining when all of thesyllables in one of said fetched instructions are present in said firstbuffer, wherein said controller, in response to a determination thatsaid all of said syllables are present, causes said all of saidsyllables to be transferred from said first buffer to said instructionexecution pipeline.
 19. The processing system as set forth in claim 18wherein said controller is capable of determining if a syllable in saidfirst one of said S storage locations in said first buffer has beentransferred from said first buffer to said instruction pipeline, whereinsaid controller, in response to a determination that said first one ofsaid S storage locations has been transferred, causes said correspondingsyllable stored in said corresponding one of said S storage locations insaid second buffer to be transferred to said first one of said S storagelocations in said first buffer.
 20. The processing system as set forthin claim 19 further comprising a switching circuit controlled by saidcontroller and operable to transfer syllables from said second buffer tosaid first buffer.
 21. For use in a data processor comprising aninstruction execution pipeline comprising N processing stages, a methodof fetching into the instruction execution pipeline instructions fetchedfrom an instruction cache associated with the data processor, each ofthe fetched instructions comprising from one to S syllables, the methodof fetching comprising the steps of: storing in a first buffercomprising S storage locations the one to S syllables associated withthe fetched instructions, each of the S storage locations capable ofstoring one of the one to S syllables of each fetched instruction;determining if a first one of the S storage locations in the firstbuffer is full; and in response to a determination that the first one ofthe S storage locations is full, storing a corresponding syllable in anincoming fetched instruction in a corresponding one of S storagelocations in a second buffer, wherein the second buffer comprises Sstorage locations, each of the S storage locations in the second buffercapable of storing one of the one to S syllables of each fetchedinstruction.
 22. The method as set forth in claim 21 wherein S is amultiple of four.
 23. The method as set forth in claim 21 wherein eachof the one to S syllables comprises one of: a) 16 bits, b) 32 bits, andc) 64 bits.
 24. The method as set forth in claim 21 further comprisingthe steps of: determining when all of the syllables in one of thefetched instructions are present in the first buffer; and in response toa determination that all of the syllables are present, transferring allof the syllables from the first buffer to the instruction executionpipeline.
 25. The method as set forth in claim 24 further comprising thesteps of: determining if a syllable in the first one of the S storagelocations in the first buffer has been transferred from the first bufferto the instruction pipeline; and in response to a determination that thefirst one of the S storage locations has been transferred, transferringthe corresponding syllable stored in the corresponding one of the Sstorage locations in the second buffer to the first one of the S storagelocations in the first buffer.